Team Members:¶
- Alua Onayeva
- Aslan Askarbek
- Rakhat Zhangabay

from jupyterquiz import display_quiz
import numpy as np
At this stage, having acquired the necessary data and a preliminary idea about the models you intend to train and implement, you might be contemplating what steps to take next. How can you ensure that your model's performance isn't merely a consequence of chance in the data selection process? How do you determine if the selected model outperforms others? Additionally, what measures should be considered if the available data is limited, potentially leading to overfitting in the models?
The answer to all these questions is simple - use Cross-Validation.
Cross-validation¶
Cross-validation - basically, we randomly divide dataset into groups. The model is trained on one piece of data and tested on another. Repeat the process several times and take avarage metric-results from each iterations. This works for both classification and regression problems.
For sure, to choose the approach you should take into account small/large, balanced/unbalanced datasets, and time series/non-time series data.
The main purpose¶
Better Performance Evaluation: Since it gives a more precise estimation of the model's ability to generalize to unseen data compared to a single train-test split.
Hyperparameters Settings: Through cross-validation and hyperparameter tuning, it can seen how the model performs across several folds. It can help in identifying hyperparameters with better performance.
Overfitting Avoidance: Hyperparameter tuning without cross-validation might lead to overfitting to a specific train-test split. Cross-validation mitigates this risk by evaluating hyperparameters across various data subsets, ensuring better generalization.
display_quiz("#qqq1")
Train and Test splits¶
Idea: Randomly divide the data into training and test data, the same for all models. The quality of the models and resistance to overfit are checked using test data. This is a common choice and a quick to go validation method.
Commonly used values: 80% training and 20% test, 70% training and 30% test.
However, this approach has major weaknesses - direct dependence on which data was included in the train and which in the test groups, and the following approaches solve this problem.
import warnings
warnings.filterwarnings("ignore")
%%js
cells = Jupyter.notebook.get_cells();
for (i = 0; i < cells.length; i++) {
cur_cell = cells[i];
tags = cur_cell._metadata.tags;
if (tags != undefined) {
for (j = 0; j < tags.length; j++) {
if (tags[j]=="hide-tag") {cur_cell.element.hide();}
}}}
We will use disaste dataset. It includes information on age, gender, height, weight, blood pressure values, cholesterol levels, glucose levels, smoking habits and alcohol consumption of over 70 thousand individuals.
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('./dataset/train.csv')
df['Sex']=df['Sex'].apply(lambda x: 1 if x=='male' else 0)
df.loc[df['Age'].isna(),'Age']=-1
X = df[['Sex','Pclass','Age', 'SibSp','Parch','Fare']]
y = df["Survived"]
X_train,X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=51)
print(X_train.shape)
print(X_test.shape)
(712, 6) (179, 6)
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
As you can see below, the distribution of the classes is approximately the same, thus this dataset can be considered as a more or less balanced.
y.value_counts()
Survived 0 549 1 342 Name: count, dtype: int64
Types of Cross-validation¶
K-Fold¶
The main idea of this approach is to split the whole dataset in $K$ parts of equal size and each partition is called a fold.
One fold is used for validation and other $K-1$ folds are used for training the model. To use every fold as a validation set and other left-outs as a training set, this technique is repeated $k$ times until each fold is used once. This approach results in every observation being used both in train and test groups.

Standard values:
Most commonly, the number of folds used are 5 or 10.
This validation technique is not considered suitable for imbalanced datasets as the model will not get trained properly owing to the proper ratio of each class's data. This issue could be resolved using the following method.
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
kf =KFold(n_splits=5, shuffle=True, random_state=42)
count = 1
for train_index, test_index in kf.split(X, y):
print(f'Fold:{count}, Train set: {len(train_index)}, Test set:{len(test_index)}')
count += 1
Fold:1, Train set: 712, Test set:179 Fold:2, Train set: 713, Test set:178 Fold:3, Train set: 713, Test set:178 Fold:4, Train set: 713, Test set:178 Fold:5, Train set: 713, Test set:178
Now we will apply this method on different models, evaluate accuracy for every fold and output the mean.
Logistic Regression¶
from sklearn.linear_model import LogisticRegression
score = cross_val_score(LogisticRegression(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.78212291 0.78089888 0.83146067 0.76404494 0.79213483] Average score: 0.79
[!NOTE] As you can see, there is significant variation between scores [76%-83%], imagine using train test split this case
Random Forest¶
from sklearn.ensemble import RandomForestClassifier
score = cross_val_score(RandomForestClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.83240223 0.79775281 0.81460674 0.80337079 0.83707865] Average score: 0.82
Gradient Boosting¶
from sklearn.ensemble import GradientBoostingClassifier
score = cross_val_score(GradientBoostingClassifier(random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Scores for each fold are: {score}')
print(f'Average score: {"{:.2f}".format(score.mean())}')
Scores for each fold are: [0.80446927 0.81460674 0.87078652 0.78651685 0.82022472] Average score: 0.82
Q. Why is it important to use cross-validation with Gradient Boosting?
Due to the algorithm of gradient boosting - it tends to overfit rapidly. The boosting learns in each iteration on errors of previous iterations, reducing error on train data with each step until meets stopping criterion for the model. In order to get the best stopping criterion, cross-validation should be applied.
display_quiz("#qqq2")
display_quiz("#qqq3")
KFold Model Tuning¶
Logistic Regression¶
Applying different algorithms in Logistic Regression.
algorithms = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
warnings.filterwarnings('ignore')
for algo in algorithms:
score = cross_val_score(LogisticRegression(max_iter= 500, solver= algo, random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Average score({algo}): {"{:.3f}".format(score.mean())}')
Average score(newton-cg): 0.790 Average score(lbfgs): 0.790 Average score(liblinear): 0.792 Average score(sag): 0.725 Average score(saga): 0.708
Random Forest¶
Sorting out different maximum leaf nodes.
max_leaf_nodes = [None, 5, 10, 15, 20]
for val in max_leaf_nodes:
score = cross_val_score(RandomForestClassifier(max_leaf_nodes= val, random_state= 42), X, y, cv= kf, scoring="accuracy")
print(f'Average score({val}): {"{:.3f}".format(score.mean())}')
Average score(None): 0.817 Average score(5): 0.800 Average score(10): 0.811 Average score(15): 0.815 Average score(20): 0.817
Gradient Boosting¶
Also i can iterate through different types of parameters
from sklearn.model_selection import GridSearchCV
params = {
'n_estimators': [50, 100],
'max_depth': [3, 5, 7],
}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=42), params, cv=kf, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)
{'max_depth': 3, 'n_estimators': 100}
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
print(predictions)
[0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 1]
Stratified K-Fold¶
The approach is called for the term Stratum - divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is randomly sampled.
This is an enhanced version of the k-fold cross-validation technique. Although it too splits the dataset into k equal folds, each fold has the same ratio of instances of target variables that are in the complete dataset, helping to generalize each fold.
This enables it to work perfectly for imbalanced datasets, but not for time-series data.

Performance Comparison¶
import matplotlib.pyplot as plt
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = [
('Logistic Regression', LogisticRegression()),
('Gradient Boosting', GradientBoostingClassifier()),
('Random Forest', RandomForestClassifier())
]
kfold_scores = []
stratified_kfold_scores = []
for name, model in models:
kfold_scores.append(cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy').mean())
stratified_kfold_scores.append(cross_val_score(model, X_train, y_train, cv=skf, scoring='accuracy').mean())
fig, ax = plt.subplots(figsize=(10, 6))
bar_width = 0.4
bar_positions_kfold = np.arange(len(models))
bar_positions_stratified_kfold = bar_positions_kfold + bar_width
ax.bar(bar_positions_kfold, kfold_scores, bar_width, label='K-Fold')
ax.bar(bar_positions_stratified_kfold, stratified_kfold_scores, bar_width, label='Stratified K-Fold')
ax.set_xticks(bar_positions_kfold + bar_width / 2)
ax.set_xticklabels([model[0] for model in models])
ax.set_xlabel('Models')
ax.set_ylabel('Mean Accuracy')
ax.set_title('Mean Accuracy of Models under Different Cross-Validation Techniques')
ax.legend()
def autolabel(bars):
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:.2%}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3),
textcoords="offset points",
ha='center', va='bottom')
autolabel(ax.patches)
plt.show()
Q. Comment on the results, guess why the results are not very different?
If you scroll up, there will be a note that the dataset is balanced with binary target 60% to 40%. Therefore, the feature of Stratified K-Fold is no longer needed. However, with more complicated models it still makes sence
display_quiz("#qqq5")
Leave-One-Out Cross Validation:¶
This approach is used with a similar idea as K-Fold, in fact, we could even say that LooCV technique is identical to K-Fold of size N (whole dataset size). Should be mentioned, the intuitive difference between Leave-One-Out or Leave-P-Out is: While in K-Fold we specify the number of groups and size of groups themselves is defined according to the data size, in Leave-out methods we specify the size of the validation set itself and the number of groups will be defined according to the data size.

display_quiz("#qqq4")
Learning Curves¶
import numpy as np
import plotly.graph_objs as go
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
models = {
'Logistic Regression': LogisticRegression(),
'Gradient Boosting': GradientBoostingClassifier(),
'Random Forest': RandomForestClassifier()
}
def plot_learning_curves(models, X, y, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
colors = ['blue', 'green', 'red']
data = []
color_index = 0
for name, model in models.items():
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
trace1 = go.Scatter(
x=train_sizes, y=train_scores_mean,
mode='lines+markers',
name=f"{name} (Training score)",
line=dict(color=colors[color_index])
)
trace2 = go.Scatter(
x=train_sizes, y=test_scores_mean,
mode='lines+markers',
name=f"{name} (Cross-validation score)",
line=dict(color=colors[color_index])
)
color_index += 1
trace3 = go.Scatter(
x=np.concatenate([train_sizes, train_sizes[::-1]]),
y=np.concatenate([train_scores_mean - train_scores_std,
(train_scores_mean + train_scores_std)[::-1]]),
fill='tozerox',
fillcolor='rgba(0,100,80,0.2)',
line=dict(color='rgba(255,255,255,0)'),
showlegend=False
)
trace4 = go.Scatter(
x=np.concatenate([train_sizes, train_sizes[::-1]]),
y=np.concatenate([test_scores_mean - test_scores_std,
(test_scores_mean + test_scores_std)[::-1]]),
fill='tozerox',
fillcolor='rgba(255,140,0,0.2)',
line=dict(color='rgba(255,255,255,0)'),
showlegend=False
)
data.extend([trace1, trace2, trace3, trace4])
layout = go.Layout(
title='Learning Curves for Different Models',
xaxis=dict(title='Training examples'),
yaxis=dict(title='Score'),
legend=dict(x=0.7, y=1.1)
)
fig = go.Figure(data=data, layout=layout)
fig.show(renderer='notebook')
plot_learning_curves(models, X_train, y_train, cv=5)
In this graph you can see how the performance of the model changes with the amount training data and how different Cross-validation methods converge.
General note comparing above approaches¶
| Approach | Execution speed | Efficient with small datasets | Efficient with large datasets |
|---|---|---|---|
| Train Test split | ✔ | ✕ | ✔ |
| K-Fold | ✕ | ✔ | ✔ |
| LooCV | ✕ | ✔ | ✔ |